To start off, the necessary libraries are imported such as numpy, pandas, matplotlib, holoviews, panel and seaborn.
#data analysis:
import pandas as pd
import numpy as np
#visualisation:
import seaborn as sns
import matplotlib.pyplot as plt
#machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# visualisation tools for maps
import folium
from folium.plugins import MarkerCluster
# iteration tool for loops
import itertools
# for geo location
from geopy import geocoders
import requests
import googlemaps
Here we explore the data from the csv file
# Read the redwine.csv file into a Pandas dataframe
df = pd.read_csv('redwine.csv',index_col = 0,sep=';',encoding ='ISO-8859-1')
df.columns = [c.replace(' ', '_') for c in df.columns]
# Sample 5 random rows of the dataframe
df.sample(5)
| country | description | designation | points | price | province | taster_name | title | variety | winery | ... | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 2016 | Portugal | This is a fruity wine housing soft tannins and... | Visconde de Borba | 84 | 7.0 | Alentejo | Virginie Boone | Marcolino Sebo 2013 Visconde de Borba Red (Ale... | Portuguese Red | Marcolino Sebo | ... | 0.43 | 0.32 | 2.8 | 0.080 | 29.0 | 58.0 | 0.9974 | 3.31 | 0.64 | 10.3 |
| 2434 | Portugal | This is a bold, ripe wine from the heart of th... | Monte Alentejano | 85 | 9.0 | Alentejano | Virginie Boone | DFJ Vinhos 2014 Monte Alentejano Red (Alentejano) | Portuguese Red | DFJ Vinhos | ... | 0.68 | - | 2.2 | 0.073 | 12.0 | 22.0 | 0.9969 | 3.48 | 0.50 | 9.3 |
| 1067 | Portugal | It's a ripe wood-aged wine, showing a full bod... | Tinto | 90 | 35.0 | Alentejano | Joe Czerwinski | Altas Quintas 2007 Tinto Red (Alentejano) | Portuguese Red | Altas Quintas | ... | 0.39 | 0.23 | 8.3 | 0.051 | 28.0 | 91.0 | 0.9952 | 3.44 | 0.55 | 12.1 |
| 258 | Portugal | The structure is the thing here, a square, fir... | Colheita | 90 | 19.0 | Dão | Paul Gregutt | Vinha Paz 2008 Colheita Red (Dão) | Portuguese Red | Vinha Paz | ... | 0.62 | 0.05 | 1.9 | 0.068 | 24.0 | 42.0 | 0.9961 | 3.42 | 0.57 | 11.5 |
| 1244 | Portugal | This wine is all about black and red berry fru... | Casa Américo | 86 | NaN | Dão | Virginie Boone | Seacampo 2013 Casa Américo Red (Dão) | Portuguese Red | Seacampo | ... | 0.53 | 0.33 | 2.4 | 0.080 | 24.0 | 144.0 | 0.99655 | 3.30 | 0.60 | 9.5 |
5 rows × 21 columns
# Read the csv file with geodata into a Pandas Dataframe
geo = pd.read_csv('pt_infosolo.csv',sep=',',encoding ='ISO-8859-1')
# Sample 3 random rows of the geo-dataframe
geo.sample(3)
#For us the coordinate columns (longitude and latitude) and the "soil" column are the most interesting ones.
| X | Y | pt_infosolo_site_id | pt_infosolo_horizon_id | profile_code | longitude | latitude | altitude | coordinates | year | ... | k_ex | na_ex | cations_m | cec | cec_m | v | theta_fc | theta_wp | theta_m | institution | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 233 | -7.259258 | 39.89396 | 70 | 232 | R1 | -7.259258 | 39.89396 | 175 | Available | 2004 | ... | 0.06 | 0.01 | Not discriminated (Ammonium acetate at pH 7.0 ... | 2.63 | Not discriminated (Ammonium acetate at pH 7.0 ... | 56.3 | 0.1533 | 0.0472 | Pedotransfer functions for estimating soil wat... | Direcção Geral de Agricultura e do Desenvolv... |
| 2680 | -8.049576 | 41.59603 | 790 | 2674 | 756K | -8.049576 | 41.59603 | 650 | Estimated | 1996 | ... | 0.05 | 0.09 | Not discriminated (Ammonium acetate at pH 7.0 ... | 5.88 | Not discriminated (Ammonium acetate at pH 7.0 ... | 33.0 | 0.1785 | 0.0573 | Pedotransfer functions for estimating soil wat... | Direcção Regional de Agricultura e Pescas do... |
| 8131 | -7.182450 | 41.79227 | 2644 | 8129 | 377Q | -7.182450 | 41.79227 | 480 | Estimated | 1991 | ... | 0.01 | 0.01 | Ammonium acetate at pH 7.0 | 7.10 | Ammonium acetate at pH 7.0 | 2.2 | 0.2645 | 0.1095 | Pedotransfer functions for estimating soil wat... | Universidade de Trás-os-Montes e Alto Douro |
3 rows × 52 columns
# Show all columns of the data set
df.columns
Index(['country', 'description', 'designation', 'points', 'price', 'province',
'taster_name', 'title', 'variety', 'winery', 'fixed_acidity',
'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides',
'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH',
'sulphates', 'alcohol'],
dtype='object')
To make this data more visible we put the column names in a csv file with the column type and data type
pd.read_csv('column_types.csv',sep=';')
| Column name | Column type | Actual value type | Desired value types | Dependency | Level of measurement | |
|---|---|---|---|---|---|---|
| 0 | country | Categorical | String | String | Independent | Nominal |
| 1 | description | Alphanumerical | String | String | Independent | Nominal |
| 2 | designation | Categorical | String | String | Independent | Nominal |
| 3 | points | Numerical | Int | Int | Dependent | Interval |
| 4 | price | Numerical | Float | Float | Dependent | Ratio |
| 5 | province | Categorical | String | String | Independent | Nominal |
| 6 | taster_name | Alphanumerical | String | String | Independent | Nominal |
| 7 | title | Alphanumerical | String | String | Independent | Nominal |
| 8 | variety | Categorical | String | String | Dependent | Nominal |
| 9 | winery | Categorical | String | String | Dependent | Nominal |
| 10 | fixed acidity | Numerical | Float | Float | Dependent | Ratio |
| 11 | volatile acidity | Numerical | Float | Float | Dependent | Ratio |
| 12 | citric acid | Numerical | String | Float | Dependent | Ratio |
| 13 | residual sugar | Numerical | Float | Float | Dependent | Ratio |
| 14 | chlorides | Numerical | Float | Float | Dependent | Ratio |
| 15 | free sulfur dioxide | Numerical | Float | Float | Dependent | Ratio |
| 16 | total sulfur dioxide | Numerical | Float | Float | Dependent | Ratio |
| 17 | density | Numerical | String | Float | Dependent | Ratio |
| 18 | pH | Numerical | Float | Float | Dependent | Ratio |
| 19 | sulphates | Numerical | Float | Float | Dependent | Ratio |
| 20 | alcohol | Numerical | String | Float | Dependent | Ratio |
Here we describe different column names
The categories whose actual datatype does not match our required datatypes are listed as follows:
The outliers are calculated as 3 times the std above the mean value from a column.
As we can see in the sample below, there are 20 red wines with a exceptionally high price.
df.describe()
| points | price | fixed_acidity | volatile_acidity | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | pH | sulphates | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 2465.000000 | 2196.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 |
| mean | 88.811359 | 24.819217 | 8.352819 | 0.528162 | 2.562353 | 0.087485 | 15.804665 | 47.218661 | 3.311201 | 0.660353 |
| std | 2.974000 | 25.539475 | 1.734847 | 0.178869 | 1.407030 | 0.047043 | 10.379218 | 33.187830 | 0.153927 | 0.172531 |
| min | 81.000000 | 5.000000 | 4.600000 | 0.120000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 2.740000 | 0.330000 |
| 25% | 87.000000 | 12.000000 | 7.100000 | 0.400000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 3.210000 | 0.550000 |
| 50% | 88.000000 | 17.000000 | 7.900000 | 0.520000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 3.310000 | 0.620000 |
| 75% | 91.000000 | 28.000000 | 9.300000 | 0.635000 | 2.600000 | 0.091000 | 21.000000 | 64.000000 | 3.400000 | 0.730000 |
| max | 100.000000 | 450.000000 | 15.900000 | 1.580000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 4.010000 | 2.000000 |
# Check all empty/missing values
df.isnull().sum()
country 0 description 0 designation 153 points 0 price 269 province 0 taster_name 0 title 0 variety 0 winery 0 fixed_acidity 0 volatile_acidity 0 citric_acid 0 residual_sugar 0 chlorides 0 free_sulfur_dioxide 0 total_sulfur_dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 dtype: int64
The outliers from price below are determined by those values that are 3 times above the std of the mean of this column.
# Outliers from price:
price_col = df['price']
outliers = df[price_col > price_col.mean() + 3 * price_col.std()]
outliers
| country | description | designation | points | price | province | taster_name | title | variety | winery | ... | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 15 | Portugal | This is a major Douro wine from the Poças fami... | Símbolo | 94 | 150.0 | Douro | Matt Kettmann | Poças 2014 Símbolo Red (Douro) | Portuguese Red | Poças | ... | 0.360 | 0.34 | 1.8 | 0.075 | 18.0 | 38.0 | 0.9933 | 3.38 | 0.88 | 13.6 |
| 279 | Portugal | One of the iconic wines of Portugal, the brand... | Pêra-Manca | 94 | 275.0 | Alentejo | Michael Schachner | Cartuxa 2011 Pêra-Manca Red (Alentejo) | Portuguese Red | Cartuxa | ... | 0.815 | 0.02 | 2.7 | 0.072 | 17.0 | 34.0 | 0.9955 | 3.58 | 0.89 | 12.3 |
| 296 | Portugal | This top wine from Quinta Vale Dona Maria come... | Vinha do Rio | 95 | 150.0 | Douro | Jim Gordon | Quinta Vale Dona Maria 2012 Vinha do Rio Red (... | Portuguese Red | Quinta Vale Dona Maria | ... | 0.330 | 0.41 | 5.9 | 0.073 | 7.0 | 13.0 | 0.99658 | 3.30 | 0.62 | 12.1 |
| 356 | Portugal | Produced from selected parcels of fruit in top... | Reserva | 92 | 110.0 | Alentejano | Virginie Boone | Cortes de Cima 2012 Reserva Red (Alentejano) | Portuguese Red | Cortes de Cima | ... | 0.340 | 0.27 | 2.8 | 0.067 | 8.0 | 44.0 | 0.99384 | 3.21 | 0.56 | 12 |
| 530 | Portugal | Named after family members called Maria, this ... | Marias da Malhadinha | 92 | 158.0 | Alentejano | Sean P. Sullivan | Herdade da Malhadinha Nova 2007 Marias da Malh... | Portuguese Red | Herdade da Malhadinha Nova | ... | 0.745 | 0.12 | 1.8 | 0.114 | 15.0 | 64.0 | 0.99588 | 3.22 | 0.59 | 9.5 |
| 618 | Portugal | Named after the tower that dominates the build... | Torre do Esporão Garrafeira | 95 | 180.0 | Alentejo | Sean P. Sullivan | Herdade do Esporão 2011 Torre do Esporão Garra... | Portuguese Red | Herdade do Esporão | ... | 0.660 | - | 2.2 | 0.087 | 3.0 | 11.0 | 0.99378 | 3.71 | 0.63 | 12.8 |
| 818 | Portugal | Made from a field blend of old vines to which ... | Referência Grand Reserva | 94 | 110.0 | Douro | Paul Gregutt | Quinta Nova de Nossa Senhora do Carmo 2013 Ref... | Portuguese Red | Quinta Nova de Nossa Senhora do Carmo | ... | 0.330 | 0.49 | 3.4 | 0.093 | 54.0 | 80.0 | 10.002 | 3.30 | 0.76 | 10.7 |
| 833 | Portugal | This is the latest release of what has long be... | Barca-Velha | 100 | 450.0 | Douro | Michael Schachner | Casa Ferreirinha 2008 Barca-Velha Red (Douro) | Portuguese Red | Casa Ferreirinha | ... | 0.540 | 0.34 | 2.5 | 0.076 | 8.0 | 17.0 | 0.99235 | 3.20 | 0.72 | 13.1 |
| 897 | Portugal | Produced from old vines on the schist slopes o... | Unico | 95 | 120.0 | Douro | Jim Gordon | Conceito Vinhos 2015 Unico Red (Douro) | Portuguese Red | Conceito Vinhos | ... | 0.360 | 0.32 | 1.8 | 0.067 | 4.0 | 8.0 | 0.9928 | 3.36 | 0.55 | 12.8 |
| 1078 | Portugal | This is the latest release of Portugal's most ... | Barca Velha | 99 | 426.0 | Douro | Roger Voss | Casa Ferreirinha 2004 Barca Velha Red (Douro) | Portuguese Red | Casa Ferreirinha | ... | 0.260 | 0.54 | 1.9 | 0.083 | 42.0 | 74.0 | 0.99451 | 2.98 | 0.63 | 11.8 |
| 1153 | Portugal | Gorgeously impressive wine, super-rich, balanc... | Vinha da Ponte | 93 | 125.0 | Douro | Anne Krebiehl MW | Quinta do Crasto 2007 Vinha da Ponte Red (Douro) | Portuguese Red | Quinta do Crasto | ... | 0.400 | 0.53 | 6.7 | 0.097 | 6.0 | 19.0 | 0.9986 | 3.27 | 0.82 | 11.7 |
| 1154 | Portugal | From old vines, this powerful wine is the epit... | Adelaide | 93 | 200.0 | Douro | Roger Voss | Quinta do Vallado 2007 Adelaide Red (Douro) | Portuguese Red | Quinta do Vallado | ... | 0.280 | 0.28 | 2.4 | 0.012 | 36.0 | 100.0 | 0.99064 | 3.26 | 0.39 | 11.7 |
| 1271 | Portugal | Young and with great potential,, this is a pow... | CV Curriculum Vitae | 94 | 110.0 | Douro | Roger Voss | Quinta Vale Dona Maria 2014 CV Curriculum Vita... | Portuguese Red | Quinta Vale Dona Maria | ... | 0.370 | 0.44 | 1.6 | 0.038 | 21.0 | 42.0 | 0.99526 | 3.24 | 0.81 | 10.8 |
| 1391 | Portugal | This is a hugely powerful and dense wine, It i... | Grande Reserva | 93 | 110.0 | Douro | Michael Schachner | Quinta Nova de Nossa Senhora do Carmo 2013 Gra... | Portuguese Red | Quinta Nova de Nossa Senhora do Carmo | ... | 0.410 | 0.45 | 6.2 | 0.071 | 6.0 | 14.0 | 0.99702 | 3.21 | 0.49 | 11.8 |
| 1397 | Portugal | Ninety-year-old vines produce this ink-black w... | Vinha Maria Teresa | 95 | 145.0 | Douro | Susan Kostrzewa | Quinta do Crasto 2009 Vinha Maria Teresa Red (... | Portuguese Red | Quinta do Crasto | ... | 0.350 | 0.4 | 3.6 | 0.110 | 12.0 | 24.0 | 0.99549 | 3.23 | 0.70 | 12 |
| 1417 | Portugal | This Ferreirinha Douro Superior wine is made i... | Reserva Especial | 97 | 250.0 | Douro | Roger Voss | Casa Ferreirinha 2007 Reserva Especial Red (Do... | Portuguese Red | Casa Ferreirinha | ... | 0.420 | 0.24 | 2.0 | 0.060 | 19.0 | 50.0 | 0.9917 | 3.72 | 0.74 | 14 |
| 1479 | Portugal | Made only in the best years, this flagship win... | Icon d'Azamor | 92 | 120.0 | Alentejano | Roger Voss | Azamor 2010 Icon d'Azamor Red (Alentejano) | Portuguese Red | Azamor | ... | 0.300 | 0.42 | 2.0 | 0.052 | 6.0 | 24.0 | 0.9963 | 3.44 | 0.82 | 11.9 |
| 1501 | Portugal | One of the iconic wines of Portugal, the brand... | Pêra-Manca | 94 | 275.0 | Alentejo | Michael Schachner | Cartuxa 2011 Pêra-Manca Red (Alentejo) | Portuguese Red | Cartuxa | ... | 0.260 | 0.34 | 2.5 | 0.073 | 16.0 | 47.0 | 0.99594 | 3.40 | 0.78 | 11.3 |
| 1612 | Portugal | As its name suggests this wine comes from vine... | Vinha do Rio | 96 | 149.0 | Douro | Michael Schachner | Quinta Vale Dona Maria 2013 Vinha do Rio Red (... | Portuguese Red | Quinta Vale Dona Maria | ... | 0.350 | 0.53 | 2.6 | 0.070 | 5.0 | 16.0 | 0.9972 | 3.15 | 0.65 | 11 |
| 2021 | Portugal | From an old block of vines, producing a field ... | Vinha Maria Teresa | 94 | 160.0 | Douro | Matt Kettmann | Quinta do Crasto 2011 Vinha Maria Teresa Red (... | Portuguese Red | Quinta do Crasto | ... | 0.330 | 0.38 | 3.3 | 0.063 | 10.0 | 19.0 | 0.99468 | 3.30 | 0.73 | 12 |
20 rows × 21 columns
Here we drop the wines from the dataframe with a higher value of chloride than legally allowed in Australia. In other countries the maximum value varies "When wine contains excess sodium (excess sodium is equal to the content of sodium ions less the content of chloride ions expressed as sodium), it is generally less than 60 mg/L, a limit which may be exceeded in exceptional cases." - https://www.oiv.int/public/medias/2604/oiv-ma-d1-03.pdf
df.drop(df[df.chlorides < 0.606].index)
| country | description | designation | points | price | province | taster_name | title | variety | winery | ... | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 434 | Portugal | This wine has tannins and a solid structure th... | NaN | 86 | 27.0 | Douro | Roger Voss | Quinta do Pessegueiro 2012 Red (Douro) | Portuguese Red | Quinta do Pessegueiro | ... | 0.41 | 0.76 | 1.8 | 0.611 | 8.0 | 45.0 | 0.9968 | 3.06 | 1.26 | 9.4 |
| 1948 | Portugal | The huge concentration of this wine reflects t... | Marquès de Borba Reserva | 87 | NaN | Alentejo | Jim Gordon | J, Portugal Ramos 2003 Marquès de Borba Reserv... | Portuguese Red | J, Portugal Ramos | ... | 0.41 | 0.76 | 1.8 | 0.611 | 8.0 | 45.0 | 0.9968 | 3.06 | 1.26 | 9.4 |
| 2311 | Portugal | Bigode, the mustache, is a soft and juicy wine... | Bigode | 83 | 8.0 | Lisboa | Roger Voss | DFJ Vinhos 2010 Bigode Red (Lisboa) | Portuguese Red | DFJ Vinhos | ... | 0.52 | 1 | 3.4 | 0.610 | 32.0 | 69.0 | 0.9996 | 2.74 | 2.00 | 9.4 |
3 rows × 21 columns
Here we look at how complete and clean the data is.
First we look at all nominal/ordinal values.
df.describe(include = [object])
| country | description | designation | province | taster_name | title | variety | winery | citric_acid | density | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2465 | 2465 | 2312 | 2465 | 2465 | 2465 | 2465 | 2465 | 2465 | 2465 | 2465 |
| unique | 1 | 2281 | 1161 | 30 | 18 | 2268 | 1 | 310 | 80 | 428 | 61 |
| top | Portugal | This is a richly structured wine, With its jui... | Reserva | Douro | Roger Voss | Rui Roboredo Madeira 2012 Castello d'Alba Limi... | Portuguese Red | Wines & Winemakers | - | 0.9976 | 9.5 |
| freq | 2465 | 2 | 171 | 880 | 630 | 3 | 2465 | 84 | 203 | 59 | 234 |
Below we take a look at all the data with discrete and continuous values.
df.describe(exclude = [object])
| points | price | fixed_acidity | volatile_acidity | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | pH | sulphates | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 2465.000000 | 2196.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 | 2465.000000 |
| mean | 88.811359 | 24.819217 | 8.352819 | 0.528162 | 2.562353 | 0.087485 | 15.804665 | 47.218661 | 3.311201 | 0.660353 |
| std | 2.974000 | 25.539475 | 1.734847 | 0.178869 | 1.407030 | 0.047043 | 10.379218 | 33.187830 | 0.153927 | 0.172531 |
| min | 81.000000 | 5.000000 | 4.600000 | 0.120000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 2.740000 | 0.330000 |
| 25% | 87.000000 | 12.000000 | 7.100000 | 0.400000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 3.210000 | 0.550000 |
| 50% | 88.000000 | 17.000000 | 7.900000 | 0.520000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 3.310000 | 0.620000 |
| 75% | 91.000000 | 28.000000 | 9.300000 | 0.635000 | 2.600000 | 0.091000 | 21.000000 | 64.000000 | 3.400000 | 0.730000 |
| max | 100.000000 | 450.000000 | 15.900000 | 1.580000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 4.010000 | 2.000000 |
Here we can see the alcohol feature type being string, this is however not very usable. To be able to properly use the alcohol feature we will convert it to float.
This section changes the value types of Alcohol, Density and Citric acid to useable and readable values.
# Prints the values of the alcohol column based on their length as strings
print(df.sort_values(by="alcohol", key=lambda x: x.str.len(), ascending=False).head(15)['alcohol'].to_string())
id 1621 11.066.666.666.666.600 1796 11.066.666.666.666.600 1765 100.333.333.333.333 1671 956.666.666.666.667 1346 100.333.333.333.333 1345 100.333.333.333.333 1767 923.333.333.333.333 1766 100.333.333.333.333 1795 956.666.666.666.667 2029 923.333.333.333.333 678 10.55 1170 11.95 140 10.55 2252 10.75 1286 10.75
Below we we can see that the first value is " . ".
Values like these cannot be converted to numbers, so NaN values will be used.
A lot of these last values are way too big for density values.
These should be around 1.0, so the decimal separators will be placed after the first digit.
# Prints the values of the density
df['density'].value_counts().sort_index()
. 1
0.99007 2
0.9902 1
0.99064 4
0.9908 1
..
100.025 1
100.242 2
100.289 2
100.315 6
100.369 3
Name: density, Length: 428, dtype: int64
Below we can see that the first value is " - " and we will replace this in the next part.
Also these values are strings and we convert these to floats.
# Print values of citric_acid
df['citric_acid'].value_counts().sort_index()
- 203
0.01 47
0.02 76
0.03 41
0.04 49
...
0.75 2
0.76 6
0.78 2
0.79 2
1 1
Name: citric_acid, Length: 80, dtype: int64
# Takes a number as string, removes unnecessary full stops, and adds a decimal separator to the number
def formatalcohol(s):
s = s.replace('.', '')
if s[0] == '1':
s = s[:2] + '.' + s[2:]
else:
s = s[:1] + '.' + s[1:]
return float(s)
# Takes a number as string and places comma at right position
def format_dens_cacid(s):
try:
# Convert all values to floats
s = s.replace('.', '')
s = s[:1] + '.' + s[1:]
s = float(s)
except Exception:
# If value is not a number then make it a NaN
s = np.nan
return s
# Apply the function formatalcohol to each string
df['alcohol'] = df['alcohol'].apply(formatalcohol)
# Apply the function format_dens_cacid to each string
df['density'] = df['density'].apply(format_dens_cacid)
# Apply the function format_dens_cacid to each string
df['citric_acid'] = df['citric_acid'].apply(format_dens_cacid)
df[['alcohol','density','citric_acid']].describe()
# Dropping null/NaN values
df['price'].dropna(inplace = True)
drop_nulls = ['price', 'pH', 'citric_acid']
for null_value in drop_nulls:
df.drop(df.loc[df[null_value].isnull()].index, inplace=True)
Here we extract the numbers from the title and check if this number is a year, we also checked if the amount of wines with a year number in the title is usable.
def digitstoint(ns):
return int("".join([str(n) for n in ns]))
usableyears = 0
def sep_name_year(name):
if any([x.isdigit() for x in name]):
global usableyears
year = digitstoint([int(s) for s in name if s.isdigit()])
if year<10000:
usableyears += 1
return year
else:
return np.nan
else:
return np.nan
df['year'] = df.title.copy()
df['year'] = df['year'].map(sep_name_year)
print(f"There are {usableyears} usable years of the {len(df['title'])} wines")
There are 1980 usable years of the 2013 wines
To create a base outcome for our first research question, we take a look at out target variable: points. We calculate the mean of all given points of the wines. This will be the most likely outcome with our simple model.
baseline_proc = df.points.count() / sum(map(lambda x : x == np.median(df.points), df.points))
baseline_proc
14.482014388489208
Below we see a visual representation of the outliers of the column density.
As we can see, this column has outliers on both the most upper and lower ends.
# Visualisation of the outliers of column "density"
plt.boxplot(df['density'].dropna())
plt.xlabel('Density')
plt.ylabel('Density values')
plt.title('The division of density of the wines')
plt.show()
In the graph below we take a look at the amount of wines per province.
Most wines clearly originate from the first two provinces, Douro and Alentejano.
plt.title('Amount of wines per province')
plt.ylabel('Frequency')
plt.xlabel('Provinces')
df.province.value_counts().plot.bar()
<AxesSubplot:title={'center':'Amount of wines per province'}, xlabel='Provinces', ylabel='Frequency'>
In the graph below we see the amount of wines per alcoholic percentage.
As we can see, the wines are within the range of 9 to 15 percent alcohol.
plt.title('Amount of wines per alcoholic percentage')
plt.xlabel('Percentage of alcohol')
plt.ylabel('Frequency')
df.alcohol.plot.hist()
<AxesSubplot:title={'center':'Amount of wines per alcoholic percentage'}, xlabel='Percentage of alcohol', ylabel='Frequency'>
In the next graph the correlation between density and residual sugar is visualized.
There seems to be a small increase in residual sugar, the denser the wine gets, with a few outliers.
#df.plot.scatter(x='density',y='residual_sugar', c='DarkBlue')
plt.title('Scatterplot of density and residual sugar')
plt.xlabel('Density')
plt.ylabel('Residual sugar')
plt.scatter(df['density'],df['residual_sugar'], alpha=0.1)
<matplotlib.collections.PathCollection at 0x7f9eee43a2e0>
In the upcoming model, we see the correlation between the year of release and the given points.
There are a few outliers, but they don't seem to mean anything.
plt.title('Scatterplot of Year and Points')
plt.xlabel('Year')
plt.xlim((1990, 2020))
plt.ylabel('Points')
plt.scatter(df['year'],df['points'], alpha=0.1)
<matplotlib.collections.PathCollection at 0x7f9eee5a3460>
Just as the graph before, here we see a scatterplot between the year of release and the alcohol percentage of a wine.
There might be a slight increase in alcohol, but it seems negligable.
plt.title('Scatterplot of Year and Alcohol')
plt.xlabel('Year')
plt.xlim((1990, 2020))
plt.ylabel('Alcohol')
plt.scatter(df['year'],df['alcohol'], alpha=0.3)
<matplotlib.collections.PathCollection at 0x7f9eede3fc10>
When we look at the right side of the graph below we can see that when the concentration chlorides rises the average amount of points falls. We can also see in this graph that there are no wines with more than 600mgs of chloride.
plt.title('Scatterplot of Chloride and Points')
plt.xlabel('chlorides')
plt.ylabel('Points')
plt.scatter(df['chlorides'],df['points'], alpha=0.3)
<matplotlib.collections.PathCollection at 0x7f9ee7cdd5e0>
Below we see a histogram of the frequency of points given to a wine, which seems to have the shape of a Gaussian distribution.
df["points"].plot.kde()
plt.axvline(df["points"].mean(), color='g', linewidth=2)
plt.title("Frequency of points given to wines")
plt.xlabel("Points")
df["points"].plot.hist(density=True)
<AxesSubplot:title={'center':'Frequency of points given to wines'}, xlabel='Points', ylabel='Frequency'>
Next we research if there is a correlation between the chemical propperties in a wine and the points it is given.
print(df.corr()['points'])
plt.title('Correlation of price and points')
plt.xlabel('Price')
plt.ylabel('Points')
plt.scatter(df.price, df.points, alpha=0.3)
plt.show()
points 1.000000 price 0.589403 fixed_acidity 0.102792 volatile_acidity -0.341234 citric_acid 0.198073 residual_sugar -0.003594 chlorides -0.129550 free_sulfur_dioxide -0.078386 total_sulfur_dioxide -0.235384 density -0.199112 pH -0.026988 sulphates 0.183103 alcohol 0.481426 year -0.130050 Name: points, dtype: float64
As seen above, there seems to be a slight increase in points given to a wine, when it is more expensive.
# Here we make en temporary dataset without nan values
df_tmp = df.copy()
df_tmp.dropna(inplace=True)
# # Here we drop all the values that are not chemical properties
# X = df_tmp.drop(['points','country','description','designation','price','province','taster_name','title','variety','winery','density'],axis=1)
# y = df_tmp['points']
# # Making test and train set
# X_train,X_test,y_train,y_test=train_test_split(X,y,random_state = 0)
# # Starting learning process
# lr = LinearRegression()
# lr.fit(X_train,y_train)
# y1 = lr.intercept_+lr.coef_[0]*x
# print('a = {:.0f}, b = {:.0f}'.format(lr.intercept_, lr.coef_[0]))
# # Here we test our Model with the RMSE method
# predmpg = lr.predict(X_test)
# np.sqrt(mean_squared_error(y_test,predmpg))
Now we try every possible combination of features and check wich one has the lowest RMSE value. These features will be used for the linear regression model.
# This script tests all possible combination of features
# to check which combination give the highest score for the linear regression
y = df_tmp['points']
# List of features we want to use
features = ['fixed_acidity','volatile_acidity','citric_acid','residual_sugar','chlorides','free_sulfur_dioxide','total_sulfur_dioxide','density','pH','sulphates','alcohol']
#Make list of all combinations of features
possible_combinations = [np.asarray(subset) for l in range(1,len(features)+1) \
for subset in itertools.combinations(features,l) ]
best_RMSE = 50.00
best_comb = possible_combinations[0]
best_formule = [0,0]
for comb in possible_combinations:
X = df_tmp[comb]
# Making test and train set
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state = 0)
# Starting learning process
lr = LinearRegression()
lr.fit(X_train,y_train)
predmpg = lr.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_test,predmpg))
if RMSE < best_RMSE:
best_formule[0]=lr.intercept_
best_formule[1]=lr.coef_[0]
best_RMSE = RMSE
best_comb = comb
print(best_comb)
print(best_RMSE)
print(f'y={best_formule[0]}+{best_formule[1]}*x')
['fixed_acidity' 'citric_acid' 'residual_sugar' 'chlorides' 'total_sulfur_dioxide' 'density' 'sulphates' 'alcohol'] 2.4299997464344933 y=286.0336459689146+0.19613248775307712*x
# Using best combination of features
X = df_tmp[best_comb]
# Making test and train set
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state = 0)
# Starting learning process
lr = LinearRegression()
lr.fit(X_train,y_train)
# Check score to
lr.score(X,y)
0.290953751732433
To what extent does the soil on which the grapes are grown have an influence on the different chemical aspects of a given wine?
To research this factor we found a soil test dataset of portugal. In this Dataset there is a longitude and latitude for each test and the outcome (soil type) with it. To make this more useable we edit this datapack down bellow.
#Data reading and cleaning:
geo = pd.read_csv('pt_infosolo.csv',sep=',',encoding ='ISO-8859-1')
geo = geo[geo.year > 1995]
# Creating a copy so that the original dataframe stays in tact.
geo_tmp = geo.copy()
# Dropping rows without a soil value
geo_tmp['soil'].dropna(inplace = True)
geo_tmp.drop(geo_tmp.loc[geo_tmp['soil'].isnull()].index, inplace=True)
#Dropping unnecessary columns
geo_tmp.drop(['X', 'Y', 'pt_infosolo_site_id', 'pt_infosolo_horizon_id',
'profile_code', 'altitude', 'coordinates',
'year', 'qualifier1', 'qualifier2', 'qualifier3', 'wrb_m',
'parent_material', 'land_use', 'hor_top', 'hor_bot', 'hor_name',
'coarse', 'cs', 'fs', 'si', 'c', 'texture_m', 'bd', 'bd_m', 'oc',
'oc_m', 'n', 'n_m', 'p', 'p_m', 'k', 'k_m', 'ph_m', 'caco3',
'caco3_m', 'ca_ex', 'mg_ex', 'k_ex', 'na_ex', 'cations_m', 'cec',
'cec_m', 'v', 'theta_fc', 'theta_wp', 'theta_m', 'institution'], axis=1)
# Removing duplicate coördinates to thin out the amount of markers on the map
geo_tmp.drop_duplicates(subset=['latitude', 'longitude'],keep='last', inplace=True)b
from folium.plugins import MarkerCluster
soil_locations = geo_tmp[["latitude", "longitude", "soil"]]
m = folium.Map(location=[soil_locations.latitude.mean(), soil_locations.longitude.mean()], zoom_start=7, control_scale=True, prefer_canvas=True)
for i, location in soil_locations.iterrows():
folium.Marker([location["latitude"], location["longitude"]], popup=location["soil"]).add_to(m)
m
# Make dict of all provinces and winery in it
df_by_province = df_tmp.groupby('province')
groups = df_by_province.groups
province_wineries = {}
for k, v in groups.items():
wineries = list({df.loc[x]['winery'] for x in v})
province_wineries[k] = wineries
# Making a list af all provinc names
lst_all_provinces = list(province_wineries.keys())
#lst_all_provinces.remove('Portugal')
lst_all_provinces.remove('Portuguese Table Wine')
#lst_all_provinces.remove('Table wine')
print(lst_all_provinces)
['Alenquer', 'Alentejano', 'Alentejo', 'Algarve', 'Bairrada', 'Beira Atlantico', 'Beira Interior', 'Beiras', 'Douro', 'Duriense', 'Dão', 'Estremadura', 'Lisboa', 'Minho', 'Obidos', 'Palmela', 'Península de Setúbal', 'Ribatejano', 'Ribatejo', 'Setubal', 'Tejo', 'Terras do Dão', 'Terras do Sado', 'Trás-os-Montes', 'Vinho Verde']
Now we can test if the longitude and the latitude are useable. We use the google maps api for this.
gmaps = googlemaps.Client(key='AIzaSyAGB87IpIn12wgutDClLs3sn65B-HMxVHM')
la_lo = (str(geo['latitude'].iloc[1]) + "," + str(geo['longitude'].iloc[1]))
gmaps.reverse_geocode(la_lo)
[{'address_components': [{'long_name': 'VXC2+H9',
'short_name': 'VXC2+H9',
'types': ['plus_code']},
{'long_name': 'Colmeal',
'short_name': 'Colmeal',
'types': ['locality', 'political']},
{'long_name': 'Guarda',
'short_name': 'Guarda',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']},
{'long_name': '6440',
'short_name': '6440',
'types': ['postal_code', 'postal_code_prefix']}],
'formatted_address': 'VXC2+H9 Colmeal, Portugal',
'geometry': {'bounds': {'northeast': {'lat': 40.8715,
'lng': -7.048999999999999},
'southwest': {'lat': 40.871375, 'lng': -7.049125}},
'location': {'lat': 40.87148000000001, 'lng': -7.049093},
'location_type': 'ROOFTOP',
'viewport': {'northeast': {'lat': 40.8727864802915,
'lng': -7.047713519708498},
'southwest': {'lat': 40.8700885197085, 'lng': -7.050411480291503}}},
'place_id': 'GhIJHo8ZqIxvREARodl1b0UyHMA',
'plus_code': {'compound_code': 'VXC2+H9 Colmeal, Portugal',
'global_code': '8CGJVXC2+H9'},
'types': ['plus_code']},
{'address_components': [{'long_name': '18',
'short_name': '18',
'types': ['street_number']},
{'long_name': 'Travessa das Eiras',
'short_name': 'Tv. das Eiras',
'types': ['route']},
{'long_name': 'Guarda',
'short_name': 'Guarda',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']},
{'long_name': '6440-221',
'short_name': '6440-221',
'types': ['postal_code']}],
'formatted_address': 'Tv. das Eiras 18, 6440-221, Portugal',
'geometry': {'location': {'lat': 40.8719836, 'lng': -7.0521748},
'location_type': 'ROOFTOP',
'viewport': {'northeast': {'lat': 40.8733325802915,
'lng': -7.050825819708497},
'southwest': {'lat': 40.8706346197085, 'lng': -7.053523780291502}}},
'place_id': 'ChIJj3QuB_1iPA0RfkRJddpJIL4',
'plus_code': {'compound_code': 'VWCX+Q4 Colmeal, Portugal',
'global_code': '8CGJVWCX+Q4'},
'types': ['street_address']},
{'address_components': [{'long_name': '7',
'short_name': '7',
'types': ['street_number']},
{'long_name': 'Rua da Amoreira',
'short_name': 'R. da Amoreira',
'types': ['route']},
{'long_name': 'Guarda',
'short_name': 'Guarda',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']},
{'long_name': '6440-221',
'short_name': '6440-221',
'types': ['postal_code']}],
'formatted_address': 'R. da Amoreira 7, 6440-221, Portugal',
'geometry': {'location': {'lat': 40.8722761, 'lng': -7.0519305},
'location_type': 'RANGE_INTERPOLATED',
'viewport': {'northeast': {'lat': 40.8736250802915,
'lng': -7.050581519708498},
'southwest': {'lat': 40.8709271197085, 'lng': -7.053279480291502}}},
'place_id': 'EhpSLiBkYSBBbW9yZWlyYSA3LCBQb3J0dWdhbCIaEhgKFAoSCSWpZg_9YjwNES3bW1j45NLsEAc',
'types': ['street_address']},
{'address_components': [{'long_name': 'Rua Infante Dom Henrique',
'short_name': 'Rua Infante Dom Henrique',
'types': ['route']},
{'long_name': 'Guarda',
'short_name': 'Guarda',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']},
{'long_name': '6440',
'short_name': '6440',
'types': ['postal_code', 'postal_code_prefix']}],
'formatted_address': 'Rua Infante Dom Henrique, 6440, Portugal',
'geometry': {'bounds': {'northeast': {'lat': 40.8854816, 'lng': -7.0392675},
'southwest': {'lat': 40.8726554, 'lng': -7.0514133}},
'location': {'lat': 40.8794175, 'lng': -7.046055},
'location_type': 'GEOMETRIC_CENTER',
'viewport': {'northeast': {'lat': 40.8854816, 'lng': -7.0392675},
'southwest': {'lat': 40.8726554, 'lng': -7.0514133}}},
'place_id': 'ChIJY-t8YBljPA0RdCrje7FZEyE',
'types': ['route']},
{'address_components': [{'long_name': 'Penha de Águia',
'short_name': 'Penha de Águia',
'types': ['administrative_area_level_3', 'political']},
{'long_name': 'Guarda District',
'short_name': 'Guarda District',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']},
{'long_name': '6440',
'short_name': '6440',
'types': ['postal_code', 'postal_code_prefix']}],
'formatted_address': 'Penha de Águia, 6440, Portugal',
'geometry': {'bounds': {'northeast': {'lat': 40.8935343, 'lng': -7.0272634},
'southwest': {'lat': 40.8391825, 'lng': -7.109014}},
'location': {'lat': 40.8584345, 'lng': -7.065978299999999},
'location_type': 'APPROXIMATE',
'viewport': {'northeast': {'lat': 40.8935343, 'lng': -7.0272634},
'southwest': {'lat': 40.8391825, 'lng': -7.109014}}},
'place_id': 'ChIJXQguo4piPA0RUMiQ5L3rAAU',
'types': ['administrative_area_level_3', 'political']},
{'address_components': [{'long_name': '6440',
'short_name': '6440',
'types': ['postal_code', 'postal_code_prefix']},
{'long_name': 'Guarda District',
'short_name': 'Guarda District',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']}],
'formatted_address': '6440, Portugal',
'geometry': {'bounds': {'northeast': {'lat': 41.0364111, 'lng': -6.8001727},
'southwest': {'lat': 40.7485076, 'lng': -7.122245899999999}},
'location': {'lat': 40.8696841, 'lng': -6.9638624},
'location_type': 'APPROXIMATE',
'viewport': {'northeast': {'lat': 41.0364111, 'lng': -6.8001727},
'southwest': {'lat': 40.7485076, 'lng': -7.122245899999999}}},
'place_id': 'ChIJy1DYWSBlPA0RcLk7DsHrABw',
'types': ['postal_code', 'postal_code_prefix']},
{'address_components': [{'long_name': 'Figueira de Castelo Rodrigo',
'short_name': 'Figueira de Castelo Rodrigo',
'types': ['administrative_area_level_2', 'political']},
{'long_name': 'Guarda District',
'short_name': 'Guarda District',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']},
{'long_name': '6440',
'short_name': '6440',
'types': ['postal_code', 'postal_code_prefix']}],
'formatted_address': 'Figueira de Castelo Rodrigo, 6440, Portugal',
'geometry': {'bounds': {'northeast': {'lat': 41.0364125,
'lng': -6.800155200000001},
'southwest': {'lat': 40.7485095, 'lng': -7.122247199999999}},
'location': {'lat': 40.8696841, 'lng': -6.9638624},
'location_type': 'APPROXIMATE',
'viewport': {'northeast': {'lat': 41.0364125, 'lng': -6.800155200000001},
'southwest': {'lat': 40.7485095, 'lng': -7.122247199999999}}},
'place_id': 'ChIJy1DYWSBlPA0R0DSQ5L3rAAQ',
'types': ['administrative_area_level_2', 'political']},
{'address_components': [{'long_name': 'Guarda District',
'short_name': 'Guarda District',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']}],
'formatted_address': 'Guarda District, Portugal',
'geometry': {'bounds': {'northeast': {'lat': 41.1793516,
'lng': -6.780960599999999},
'southwest': {'lat': 40.2299187, 'lng': -7.8492267}},
'location': {'lat': 40.5385972, 'lng': -7.267577199999999},
'location_type': 'APPROXIMATE',
'viewport': {'northeast': {'lat': 41.1793516, 'lng': -6.780960599999999},
'southwest': {'lat': 40.2299187, 'lng': -7.8492267}}},
'place_id': 'ChIJsXnCIOfsPA0RoCyQ5L3rAAM',
'types': ['administrative_area_level_1', 'political']},
{'address_components': [{'long_name': 'Portugal',
'short_name': 'PT',
'types': ['country', 'political']}],
'formatted_address': 'Portugal',
'geometry': {'bounds': {'northeast': {'lat': 42.1543111,
'lng': -6.189159200000001},
'southwest': {'lat': 32.2895, 'lng': -31.4647999}},
'location': {'lat': 39.39987199999999, 'lng': -8.224454},
'location_type': 'APPROXIMATE',
'viewport': {'northeast': {'lat': 42.1543111, 'lng': -6.189159200000001},
'southwest': {'lat': 32.2895, 'lng': -31.4647999}}},
'place_id': 'ChIJ1SZCvy0kMgsRQfBOHAlLuCo',
'types': ['country', 'political']}]
Here we can see that there is allot of info to get from these requests so now we get all the info for each test in the geo test database
locations={}
for i in range(0,len(geo_tmp)):
la_lo = (str(geo['latitude'].iloc[i]) + "," + str(geo['longitude'].iloc[i]))
full_loc = gmaps.reverse_geocode(la_lo)
for x in range(len(full_loc)-1):
locations[f'{i}_{x}'] = gmaps.reverse_geocode(la_lo)[x]
locations[f'{i}_{x}']['latitude'] = geo['latitude'].iloc[i]
locations[f'{i}_{x}']['longitude'] = geo['longitude'].iloc[i]
locations[f'{i}_{x}'].pop('types')
print(len(locations))
big_lo_la_df = pd.DataFrame.from_dict(locations).transpose()
big_lo_la_df
10919
| address_components | formatted_address | geometry | place_id | plus_code | latitude | longitude | |
|---|---|---|---|---|---|---|---|
| 0_0 | [{'long_name': 'VXC2+H9', 'short_name': 'VXC2+... | VXC2+H9 Colmeal, Portugal | {'bounds': {'northeast': {'lat': 40.8715, 'lng... | GhIJHo8ZqIxvREARodl1b0UyHMA | {'compound_code': 'VXC2+H9 Colmeal, Portugal',... | 40.87148 | -7.049093 |
| 0_1 | [{'long_name': '18', 'short_name': '18', 'type... | Tv. das Eiras 18, 6440-221, Portugal | {'location': {'lat': 40.8719836, 'lng': -7.052... | ChIJj3QuB_1iPA0RfkRJddpJIL4 | {'compound_code': 'VWCX+Q4 Colmeal, Portugal',... | 40.87148 | -7.049093 |
| 0_2 | [{'long_name': '7', 'short_name': '7', 'types'... | R. da Amoreira 7, 6440-221, Portugal | {'location': {'lat': 40.8722761, 'lng': -7.051... | EhpSLiBkYSBBbW9yZWlyYSA3LCBQb3J0dWdhbCIaEhgKFA... | NaN | 40.87148 | -7.049093 |
| 0_3 | [{'long_name': 'Rua Infante Dom Henrique', 'sh... | Rua Infante Dom Henrique, 6440, Portugal | {'bounds': {'northeast': {'lat': 40.8854816, '... | ChIJY-t8YBljPA0RdCrje7FZEyE | NaN | 40.87148 | -7.049093 |
| 0_4 | [{'long_name': 'Penha de Águia', 'short_name':... | Penha de Águia, 6440, Portugal | {'bounds': {'northeast': {'lat': 40.8935343, '... | ChIJXQguo4piPA0RUMiQ5L3rAAU | NaN | 40.87148 | -7.049093 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1537_5 | [{'long_name': 'Canedo', 'short_name': 'Canedo... | Canedo, 4870 Canedo, Portugal | {'bounds': {'northeast': {'lat': 41.6523853, '... | ChIJmT6_DBArOw0RICSR5L3rAAU | NaN | 41.62 | -7.733886 |
| 1537_6 | [{'long_name': 'Canedo', 'short_name': 'Canedo... | 4870 Canedo, Portugal | {'bounds': {'northeast': {'lat': 41.6523853, '... | ChIJX--pg1MqOw0R35bFqXBmiPg | NaN | 41.62 | -7.733886 |
| 1537_7 | [{'long_name': '4870', 'short_name': '4870', '... | 4870, Portugal | {'bounds': {'northeast': {'lat': 41.6523901, '... | ChIJWYD8f93SJA0RcMs7DsHrABw | NaN | 41.62 | -7.733886 |
| 1537_8 | [{'long_name': 'Ribeira de Pena', 'short_name'... | Ribeira de Pena, 4870, Portugal | {'bounds': {'northeast': {'lat': 41.6523853, '... | ChIJr4KOfMvSJA0RJzvxamRZJO4 | NaN | 41.62 | -7.733886 |
| 1537_9 | [{'long_name': 'Vila Real District', 'short_na... | Vila Real District, Portugal | {'bounds': {'northeast': {'lat': 41.9270957, '... | ChIJrS32IGbeOg0RjsQ8ybgq0dc | NaN | 41.62 | -7.733886 |
10919 rows × 7 columns
The data is not verry clean so we save everything from every different ground test_location i an dataframe. So we can loop through every piece of adress and name there is and see of there is a province name in it.
province_soil = {}
for p in lst_all_provinces:
for i in range(len(list(big_lo_la_df.formatted_address))):
#print(type(big_lo_la_df.iloc[i].formatted_address))
if p in big_lo_la_df.iloc[i].formatted_address:
#print(big_lo_la_df.iloc[i].formatted_address)
if p in province_soil.keys():
province_soil[p].append(geo_tmp['soil'][(geo_tmp.latitude == big_lo_la_df.iloc[i].latitude) & (geo_tmp.longitude == big_lo_la_df.iloc[i].longitude)].values[0])
else:
province_soil[p] = []
province_soil[p].append(geo_tmp['soil'][(geo_tmp.latitude == big_lo_la_df.iloc[i].latitude) & (geo_tmp.longitude == big_lo_la_df.iloc[i].longitude)].values[0])
for x in range(len(big_lo_la_df.iloc[i].address_components)):
for key,value in big_lo_la_df.iloc[i].address_components[x].items():
#print(value)
if p in value:
#print(p,geo_tmp['soil'][(geo_tmp.latitude == big_lo_la_df.iloc[i].latitude) & (geo_tmp.longitude == big_lo_la_df.iloc[i].longitude)])
#print(p,big_lo_la_df.iloc[i].longitude,big_lo_la_df.iloc[i].latitude)
if p in province_soil.keys():
province_soil[p].append(geo_tmp['soil'][(geo_tmp.latitude == big_lo_la_df.iloc[i].latitude) & (geo_tmp.longitude == big_lo_la_df.iloc[i].longitude)].values[0])
else:
province_soil[p] = []
province_soil[p].append(geo_tmp['soil'][(geo_tmp.latitude == big_lo_la_df.iloc[i].latitude) & (geo_tmp.longitude == big_lo_la_df.iloc[i].longitude)].values[0])
province_soil
{'Bairrada': ['Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols'],
'Douro': ['Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Anthrosols',
'Leptosols',
'Leptosols',
'Leptosols',
'Leptosols',
'Leptosols',
'Leptosols'],
'Dão': ['Leptosols', 'Leptosols', 'Leptosols'],
'Minho': ['Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols',
'Acrisols'],
'Tejo': ['Luvisols',
'Luvisols',
'Luvisols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Planosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Regosols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Cambisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Alisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols',
'Luvisols']}
Now we can make a list of the provinces with a soil test and what soil type did come most out of the test for this prvince.
copy_province_soil = province_soil.copy()
for key, val in copy_province_soil.items():
copy_province_soil[key] = max(set(val), key=val.count)
copy_province_soil
{'Bairrada': 'Acrisols',
'Douro': 'Anthrosols',
'Dão': 'Leptosols',
'Minho': 'Acrisols',
'Tejo': 'Cambisols'}
#A basic function to replace the province name with a Soil type
def replace_province(x):
return copy_province_soil[x]
Now we can filter every wine that has a province with a known soil type.
# Here we make en temporary dataset without nan values
# ['fixed_acidity','volatile_acidity','citric_acid','residual_sugar','chlorides','free_sulfur_dioxide','total_sulfur_dioxide','density','sulphates','alcohol']
df_with_soilType= df_tmp.drop(['points','year','country','description','designation','price','taster_name','title','variety','winery'],axis=1)
#X.set_index('province')
print(len(df_with_soilType))
df_with_soilType = df_with_soilType[df_with_soilType['province'].isin(copy_province_soil.keys())]
print(len(df_with_soilType))
df_with_soilType['soil'] = df_with_soilType['province'].copy()
df_with_soilType['soil'] = df_with_soilType['soil'].map(replace_province)
df_with_soilType.head()
1857 1050
| province | fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | soil | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||
| 1 | Douro | 10.0 | 0.29 | 0.40 | 2.9 | 0.098 | 10.0 | 26.0 | 1.00060 | 3.48 | 0.91 | 9.7 | Anthrosols |
| 7 | Tejo | 10.8 | 0.47 | 0.43 | 2.1 | 0.171 | 27.0 | 66.0 | 0.99820 | 3.17 | 0.76 | 10.8 | Cambisols |
| 11 | Douro | 8.8 | 0.27 | 0.46 | 2.1 | 0.095 | 20.0 | 29.0 | 0.99488 | 3.26 | 0.56 | 11.3 | Anthrosols |
| 13 | Douro | 9.3 | 0.27 | 0.41 | 2.0 | 0.091 | 6.0 | 16.0 | 0.99800 | 3.28 | 0.70 | 9.7 | Anthrosols |
| 15 | Douro | 7.4 | 0.36 | 0.34 | 1.8 | 0.075 | 18.0 | 38.0 | 0.99330 | 3.38 | 0.88 | 13.6 | Anthrosols |
means = {}
for soil in copy_province_soil.values():
#if soil in means.keys():
means[soil] = df_with_soilType[df_with_soilType.soil == soil].describe().loc['mean']
means = pd.DataFrame(means).transpose()
means
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Acrisols | 8.141818 | 0.504455 | 0.286182 | 2.427273 | 0.086436 | 18.309091 | 59.654545 | 0.996791 | 3.325273 | 0.657091 | 10.210909 |
| Anthrosols | 8.541592 | 0.499212 | 0.313228 | 2.612462 | 0.087715 | 15.307057 | 46.743243 | 0.996857 | 3.292267 | 0.681517 | 10.503303 |
| Leptosols | 8.738060 | 0.489925 | 0.329403 | 2.857836 | 0.086403 | 17.003731 | 49.507463 | 0.996916 | 3.277463 | 0.660896 | 10.659701 |
| Cambisols | 8.474359 | 0.531487 | 0.297795 | 2.783590 | 0.086903 | 16.320513 | 52.533333 | 0.997148 | 3.291846 | 0.652359 | 10.241795 |
sns.pairplot(df_with_soilType,hue='soil')
<seaborn.axisgrid.PairGrid at 0x7f9eee08c520>
The Pairplot is to find out if one of the combinations of chemicals has a verry seperating factor on the colours of the provinces and herb the soiltypes, but this is not the case so the conclusion is that there is little effect on the wine when the soil is different. Although there are a few things that can be done better:
Here we cluster our dataset using different discrete/continuous features as the axes.
We have tried many different combinations, of which almost all produce uninteresting clusterings.
Below is our methodology, along with a few examples of uninteresting clusterings.
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
def show_clusters(x_axis, y_axis):
x_set = (df[x_axis]-df[x_axis].mean())/df[x_axis].std()
y_set = (df[y_axis]-df[y_axis].mean())/df[y_axis].std()
xy = list(zip(x_set, y_set))
km = KMeans(n_clusters= 3)
km.fit(xy)
cluster = km.predict(xy)
plt.scatter(x_set, y_set, c=cluster)
style = lambda s: "pH" if s == "pH" else s.replace("_", " ").capitalize()
plt.title(f"{style(x_axis)} vs {style(y_axis)} scatterplot")
plt.xlabel(style(x_axis))
plt.ylabel(style(y_axis))
plt.show()
plt.close()
show_clusters("free_sulfur_dioxide", "total_sulfur_dioxide")
show_clusters("alcohol", "price")
show_clusters("pH", "sulphates")
Interesting clustering:
One interesting clustering we found is when using the features "Residual sugar" and "Chlorides".
This clustering seems to imply some sort of correlation, where the following statements hold.
show_clusters("residual_sugar", "chlorides")